Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore memory queue's internal event cleanup after a batch is vended #41356

Merged
merged 2 commits into from
Oct 22, 2024

Conversation

faec
Copy link
Contributor

@faec faec commented Oct 21, 2024

Fix #41355, where event data in the memory queue was not being freed when event batches were acknowledged, but only gradually as the queue buffer was overwritten by later events. This gave the same effect as if all beat instances, even low-volume ones, were running with a full / saturated event queue.

The root cause, found by @swiatekm, is this PR, an unrelated cleanup of old code that accidentally included one live call along with the deprecated ones. (There was an old FreeEntries hook in pipeline batches that was only used for deprecated shipper configs, but the cleanup also removed the FreeEntries call inside the queue which was essential for releasing event memory.)

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

@faec faec added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team backport-8.15 Automated backport to the 8.15 branch with mergify backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify labels Oct 21, 2024
@faec faec self-assigned this Oct 21, 2024
@faec faec requested a review from a team as a code owner October 21, 2024 20:34
@faec faec requested review from mauri870 and khushijain21 October 21, 2024 20:34
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Oct 21, 2024
Copy link
Collaborator

@jlind23 jlind23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dummy question - Any reason why the disk queue freeentries method does not do anything?

@pierrehilbert pierrehilbert requested review from rdner and removed request for khushijain21 October 22, 2024 07:40
Copy link
Member

@rdner rdner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes make sense to me.

@@ -127,6 +133,10 @@ func (b *mockQueueBatch) Entry(i int) queue.Entry {
return fmt.Sprintf("event %v", i)
}

func (b *mockQueueBatch) FreeEntries() {
b.freeEntriesCalled++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can cause problems if the test code runs in multiple goroutines somehow, might not be a problem rn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, though right now this struct is only used in a single synchronous call to newBatch.

Copy link
Member

@mauri870 mauri870 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@faec
Copy link
Contributor Author

faec commented Oct 22, 2024

@jlind23

Any reason why the disk queue freeentries method does not do anything?

Since the disk queue's main buffer isn't in memory, it isn't hit by this bug. Originally, using FreeEntries this way was a memory / GC optimization and the benchmarking focus was on the memory queue. We could implement FreeEntries in the disk queue also and probably get a moderate memory reduction, but it would require some refactoring in how the disk queue tracks acknowledgments. (I'd be happy to spend some time cleaning up the disk queue's ack handling but it's never been top priority :-) )

@faec faec merged commit fdb912a into elastic:main Oct 22, 2024
140 of 142 checks passed
@faec faec deleted the event-memory-fix branch October 22, 2024 12:58
mergify bot pushed a commit that referenced this pull request Oct 22, 2024
…#41356)

Fix #41355, where event data in the memory queue was not being freed when event batches were acknowledged, but only gradually as the queue buffer was overwritten by later events. This gave the same effect as if all beat instances, even low-volume ones, were running with a full / saturated event queue.

The root cause, found by @swiatekm, is [this PR](#39584), an unrelated cleanup of old code that accidentally included one live call along with the deprecated ones. (There was an old `FreeEntries` hook in pipeline batches that was only used for deprecated shipper configs, but the cleanup also removed the `FreeEntries` call _inside_ the queue which was essential for releasing event memory.)

(cherry picked from commit fdb912a)
mergify bot pushed a commit that referenced this pull request Oct 22, 2024
…#41356)

Fix #41355, where event data in the memory queue was not being freed when event batches were acknowledged, but only gradually as the queue buffer was overwritten by later events. This gave the same effect as if all beat instances, even low-volume ones, were running with a full / saturated event queue.

The root cause, found by @swiatekm, is [this PR](#39584), an unrelated cleanup of old code that accidentally included one live call along with the deprecated ones. (There was an old `FreeEntries` hook in pipeline batches that was only used for deprecated shipper configs, but the cleanup also removed the `FreeEntries` call _inside_ the queue which was essential for releasing event memory.)

(cherry picked from commit fdb912a)
mergify bot pushed a commit that referenced this pull request Oct 22, 2024
…#41356)

Fix #41355, where event data in the memory queue was not being freed when event batches were acknowledged, but only gradually as the queue buffer was overwritten by later events. This gave the same effect as if all beat instances, even low-volume ones, were running with a full / saturated event queue.

The root cause, found by @swiatekm, is [this PR](#39584), an unrelated cleanup of old code that accidentally included one live call along with the deprecated ones. (There was an old `FreeEntries` hook in pipeline batches that was only used for deprecated shipper configs, but the cleanup also removed the `FreeEntries` call _inside_ the queue which was essential for releasing event memory.)

(cherry picked from commit fdb912a)
faec added a commit that referenced this pull request Oct 22, 2024
…#41356) (#41364)

Fix #41355, where event data in the memory queue was not being freed when event batches were acknowledged, but only gradually as the queue buffer was overwritten by later events. This gave the same effect as if all beat instances, even low-volume ones, were running with a full / saturated event queue.

The root cause, found by @swiatekm, is [this PR](#39584), an unrelated cleanup of old code that accidentally included one live call along with the deprecated ones. (There was an old `FreeEntries` hook in pipeline batches that was only used for deprecated shipper configs, but the cleanup also removed the `FreeEntries` call _inside_ the queue which was essential for releasing event memory.)

(cherry picked from commit fdb912a)

Co-authored-by: Fae Charlton <fae.charlton@elastic.co>
faec added a commit that referenced this pull request Oct 22, 2024
…#41356) (#41363)

Fix #41355, where event data in the memory queue was not being freed when event batches were acknowledged, but only gradually as the queue buffer was overwritten by later events. This gave the same effect as if all beat instances, even low-volume ones, were running with a full / saturated event queue.

The root cause, found by @swiatekm, is [this PR](#39584), an unrelated cleanup of old code that accidentally included one live call along with the deprecated ones. (There was an old `FreeEntries` hook in pipeline batches that was only used for deprecated shipper configs, but the cleanup also removed the `FreeEntries` call _inside_ the queue which was essential for releasing event memory.)

(cherry picked from commit fdb912a)

Co-authored-by: Fae Charlton <fae.charlton@elastic.co>
faec added a commit that referenced this pull request Oct 22, 2024
…#41356) (#41362)

Fix #41355, where event data in the memory queue was not being freed when event batches were acknowledged, but only gradually as the queue buffer was overwritten by later events. This gave the same effect as if all beat instances, even low-volume ones, were running with a full / saturated event queue.

The root cause, found by @swiatekm, is [this PR](#39584), an unrelated cleanup of old code that accidentally included one live call along with the deprecated ones. (There was an old `FreeEntries` hook in pipeline batches that was only used for deprecated shipper configs, but the cleanup also removed the `FreeEntries` call _inside_ the queue which was essential for releasing event memory.)

(cherry picked from commit fdb912a)

Co-authored-by: Fae Charlton <fae.charlton@elastic.co>
@cmacknz
Copy link
Member

cmacknz commented Oct 22, 2024

@faec can you create an issue describing the benchmark we would have needed to have caught this before release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.15 Automated backport to the 8.15 branch with mergify backport-8.16 Automated backport with mergify bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Queue keeps stale event data in memory in 8.15
7 participants